Effects of COVID-19 pandemic on hospital-acquired infections

BMIN503/EPID600 Final Project

Author

Kevin Mears


1 Overview

The goal of my final project is to determine how the COVID-19 pandemic and the surge in hospital visits, changes in cleaning/hygiene, and people staying at home affected the incidence of hospital-associated infections. I identified the 2020 data set from the CDC National Hospital Care Survey (NHCS) which contains 132,694 inpatients and 388,753 emergency department visits. It contains information including patient age, sex, the month they were discharged, and their diagnoses. I will use this data to determine how the incidence of hospital-associated infections changed month-to-month in 2020 and whether there are correlations with other parameters (e.g. COVID-19 diagnosis).

I spoke with Dr. Joseph Zackular (Assistant Professor of Pathology and Laboratory Medicine at CHOP) who is a bacteriologist and an expert on C. difficile, the most common hospital-acquired infection. He suggested I stratify the dataset based on sex and age as demographics that might affect hospital infections. He suggested I look at bacterial pneumonia as a positive control as it is commonly associated with COVID co-infection.

I also spoke with Dr. Kyle Bittinger (Bioinformatics Laboratory Director, CHOP Microbiome Center) who gave suggestions on the code and analysis. For example, he suggested I adjust for other variables (sex, age, etc.) for analyzing the association between COVID-19 and secondary infections. He also suggested I expand on the change over time in age proportions in the bacterial pneumonia as that is the most interesting finding.

Final Project Github Repo

2 Introduction

Hospital-acquired infections (HAI) are infections acquired after hospital admission. It includes catheter-associated urinary tract infections, central line-associated blood stream infections, surgical site infections, ventilator-associated and hospital-acquired pneumonia, and Clostridioides difficile infections. The risk of HAIs depend on many factors, including the facility’s disinfection and infection prevention practices, the patient’s immune status, length of stay in the hospital, co-morbitities, ventilator support, and use of invasive procedures. Receipt of antibiotics is one of the major risk factors for developing Clostridioides difficile infection and other multi-drug resistant bacterial infections (e.g. Vancomycin resistant Enterococcus or Methicillin resistant Staphylococcus aureus). According to the CDC, about 4% of hospitalized patients experience at least one HAI, with an estimated 648,000 cases in 2011; these are dominated by pneumonia, surgical site infections, gastrointestinal infections, UTIs, and bloodstream infections. Clostidioides difficile infection is the most common cause of hospital acquired infections and can cause severe, potentially life-threatening colitis. Common causes of hospital-associated and ventilator-associated pneumonia are Staphylococcus aureus, Pseudomonas aeruginosa, E. coli, and Klebsiella penumoniae. Common causes of catheter-associated UTIs are Enterococcus, Staphylcoccus aureus, Pseudomonas, Proteus, Klebsiella, and Candida. Common causes of surgical site infections include Staphylococcus aureus, other Staphylococcus, Enterococcus, E. coli, Pseudomonas aeruginosa, Enterobacter, and Klebsiella.

The COVID-19 pandemic led to a surge in hospital visits, overwhelming our healthcare system and led to over a million deaths. Coincidentally, disinfection practices intensified in public facilities including hospitals. Many causes of hospital-associated infections, including C. difficile, MRSA, VRE, norovirus are difficult to disinfect using traditional cleaning methods. Additionally, antibiotics exposure increased during the pandemic. The present study aims to determine how the COVID-19 pandemic affected the incidence of hospital-associated infections in 2020.

This is an interdisciplinary question because it requires an understanding of microbiology and the various factors that affect pathogenesis, epidemiology to appreciate how disease is spread in a hospital setting, and data science for bioinformatics analysis of hospital survey data.

Sources:

  • Monegro et al. Hospital-Acquired Infections. (2023). In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing. https://www.ncbi.nlm.nih.gov/books/NBK441857/

  • Dewey et al. Increased Use of Disinfectants During the COVID-19 Pandemic and Its Potential Impacts on Health and Safety. (2021) Acs Chem Health Safety.

  • Dancer, S.J. Controlling Hospital-Acquired Infection: Focus on the Role of the Environment and New Technologies for Decontamination. (2014) Clin Microbiol Rev.

3 Methods

I used the 2020 data set from the CDC National Hospital Care Survey (NHCS) to answer my question. I first imported the data set and determined the variables available. The useful variables in the data were the age, sex, length of stay, discharge month, and diagnosis.

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("ggplot2")
library("cowplot")

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library("forcats") # for fct_relevel

Inpatient data set

# Read in NHCS 2020 inpatient Public Use File R Dataset
#https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ip_r.rds
url <- "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ip_r.rds"
nhcs2020ip <- read_rds(url)
# inpatient data set
#pull out variables list
var_2020 <- nhcs2020ip$variables

#select useful columns
var_2020_select <- select(var_2020, 1:71)

#variable names 
varnames_2020 <- colnames(var_2020_select)

#pull out primary diagnosis (DX1)
diag1_2020 <- var_2020_select$DX1

#determine the number of primary C. diff infections (ICD-10 diagnosis code: A047)
primaryCdiff_2020 <- sum(diag1_2020 == "A047", na.rm = TRUE)

I next determined the proportion of patients that stayed only for 1 day.

# count the number of patients that stayed for only 1 day
var_2020_LOS_1 <- var_2020_select |>
  filter(LOS == 1) |>
  summarize(count = n()) |>
  pull(count)

ip_total_count <- var_2020_select |>
  group_by(LOS) |>
  summarize(count = n()) |>
  summarize(sum(count))

prop_ip_routine <- var_2020_LOS_1 / ip_total_count
prop_ip_routine
  sum(count)
1  0.1610849
  • 16% of inpatient visits stayed only 1 day
#combine all diagnosis
alldiag_2020 <- var_2020_select |>
  pivot_longer(
    cols = c(DX1:DX30), 
    values_to = "DX"
  )

#all COVID infections
all_covid <- sum(alldiag_2020 =="U071", na.rm = TRUE)
all_covid
[1] 6452
#all C.diff infection
allCdiff <- sum(alldiag_2020 == "A047", na.rm = TRUE)
allCdiff
[1] 992
  • There were 6452 total COVID-19 diagnosis.

  • There were 992 total C. difficile infection

I assigned the description of the diagnosis (ICD-10 code description) to the data set to better understand the patients in the data set.

# Add descriptions to diagnosis
# Read the lines from the file
lines <- readLines("./2021-code-descriptions-tabular-order/icd10cm_codes_2021.txt")

# Split each line by 2 or more spaces
split_lines <- strsplit(lines, "\\s{2,}")

# Find the maximum number of columns
max_length <- max(sapply(split_lines, length))

# Pad shorter rows with NA
padded_lines <- lapply(split_lines, function(x) {
  length(x) <- max_length  # Extend the length
  return(x)  # Returns the padded line
})

# Convert the list to a data frame
icd_code <- do.call(rbind, lapply(padded_lines, function(x) as.data.frame(t(x), stringsAsFactors = FALSE)))

# Optionally, set column names if needed
colnames(icd_code) <- c("DX", "Description")

alldiag_desc <- left_join(alldiag_2020, icd_code)
Joining with `by = join_by(DX)`
# diagnosis arranged by number of cases
alldiag_desc |>
  group_by(Description) |>
  summarize(count = n()) |>
  arrange(desc(count))
# A tibble: 3,814 × 2
   Description                                              count
   <chr>                                                    <int>
 1 <NA>                                                   3217815
 2 Essential (primary) hypertension                         30922
 3 Hyperlipidemia, unspecified                              27947
 4 Acute kidney failure, unspecified                        18190
 5 Gastro-esophageal reflux disease without esophagitis     17444
 6 Single live birth                                        12200
 7 Encounter for immunization                               11723
 8 Major depressive disorder, single episode, unspecified   11503
 9 Anxiety disorder, unspecified                            11406
10 Hypo-osmolality and hyponatremia                         10649
# ℹ 3,804 more rows

I next filtered patients who were diagnosed with COVID-19 and hospital-associated infections and subsequently plotted by the variables (sex, age, length of stay, discharge month).

#subset COVID-19 patients
covid_patients <- filter(alldiag_2020, DX =="U071")
covid_patients <- covid_patients |>
  filter(!is.na(AGE)) |>
  filter(!is.na(LOS_30DAYS))
#re-level age
covid_patients <- covid_patients |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(covid_patients, age.simplified)
#count(covid_patients, AGE)
#count(covid_patients, SEX)
#count(covid_patients, LOS) #length of stay
#count(covid_patients, LOS_30DAYS) #greater than 30 days
#count(covid_patients, DISCHARGE_MONTH)
#count(covid_patients, DISCHARGE_STATUS)
#visualization
ggplot(covid_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("COVID-19 cases by sex") +
  theme_bw()

ggplot(covid_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("COVID-19 cases by age") +
  theme_bw()

ggplot(covid_patients, aes(x = LOS)) +
  geom_bar() +
  labs(x = "Length of stay (up to 14 days)", y = "Count") +
  ggtitle("COVID-19 cases by length of stay") +
  theme_bw()

#greater than 15 binned in data

ggplot(covid_patients, aes(x = LOS_30DAYS)) +
  geom_bar() +
  labs(x = "Length of stay greater than 30 days", y = "Count") +
  ggtitle("COVID-19 cases by length of stay greater than 30 days") +
  theme_bw()

ggplot(covid_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("COVID-19 cases by discharge month") +
  theme_bw()

#subset bacterial pneumonia patients
bactpneumo_patients <- filter(alldiag_2020, DX %in% c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160"))
#J13     Pneumonia due to Streptococcus pneumoniae
#J14     Pneumonia due to Hemophilus influenzae
#J150    Pneumonia due to Klebsiella pneumoniae
#J151    Pneumonia due to Pseudomonas
#J1520   Pneumonia due to staphylococcus, unspecified
#J15211  Pneumonia due to Methicillin susceptible Staphylococcus aureus
#J15212  Pneumonia due to Methicillin resistant Staphylococcus aureus
#J1529   Pneumonia due to other staphylococcus
#J153    Pneumonia due to streptococcus, group B
#J154    Pneumonia due to other streptococci
#J155    Pneumonia due to Escherichia coli
#J1561   Pneumonia due to Acinetobacter baumannii
#J1569   Pneumonia due to other Gram-negative bacteria
#J157    Pneumonia due to Mycoplasma pneumoniae
#J158    Pneumonia due to other specified bacteria
#J159    Unspecified bacterial pneumonia
#J160    Chlamydial pneumonia
#re-level age
bactpneumo_patients <- bactpneumo_patients |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(bactpneumo_patients, age.simplified)
#count(bactpneumo_patients, SEX)
#count(bactpneumo_patients, LOS) #length of stay
#count(bactpneumo_patients, LOS_30DAYS) #greater than 30 days
#count(bactpneumo_patients, DISCHARGE_MONTH)
#count(bactpneumo_patients, DISCHARGE_STATUS)
#visualization
ggplot(bactpneumo_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Bacterial penumonia cases by sex") +
  theme_bw()

ggplot(bactpneumo_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
    ggtitle("Bacterial penumonia cases by age") +
  theme_bw()

ggplot(bactpneumo_patients, aes(x = LOS)) +
  geom_bar() +
  labs(x = "Length of stay (up to 14 days)", y = "Count") +
    ggtitle("Bacterial penumonia cases by length of stay") +
  theme_bw()

#greater than 15 binned in data

ggplot(bactpneumo_patients, aes(x = LOS_30DAYS)) +
  geom_bar() +
  labs(x = "Length of stay greater than 30 days", y = "Count") +
  ggtitle("Bacterial penumonia cases by length of stay greater than 30 days") +
  theme_bw()

ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
    ggtitle("Bacterial penumonia cases by discharge month") +
  theme_bw()

#subset C.diff patients
Cdiff_patients <- filter(alldiag_2020, DX == "A047")
#plot number of cases by demographics
#re-level age
Cdiff_patients <- Cdiff_patients |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(Cdiff_patients, age.simplified)
#count(Cdiff_patients, SEX)
#count(Cdiff_patients, LOS) #length of stay
#count(Cdiff_patients, LOS_30DAYS) #greater than 30 days
#count(Cdiff_patients, DISCHARGE_MONTH)
#count(Cdiff_patients, DISCHARGE_STATUS)
#visualization
ggplot(Cdiff_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("C. difficile infection by sex") +
  theme_bw()

ggplot(Cdiff_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("C. difficile infection by age") +
  theme_bw()

ggplot(Cdiff_patients, aes(x = LOS)) +
  geom_bar() +
  labs(x = "Length of stay (up to 14 days)", y = "Count") +
  ggtitle("C. difficile infection by length of stay") +
  theme_bw()

#greater than 15 binned in data

ggplot(Cdiff_patients, aes(x = LOS_30DAYS)) +
  geom_bar() +
  labs(x = "Length of stay greater than 30 days", y = "Count") +
  ggtitle("C. difficile infection by length of stay greater than 30 days") +
  theme_bw()

ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("C. difficile infection by discharge month") +
  theme_bw()

#MRSA
mrsa_patients <- filter(alldiag_2020, DX %in% c("A4102", "A4902", "B9562", "J15212"))
#A4102   Sepsis due to Methicillin resistant Staphylococcus aureus
#A4902   Methicillin resistant Staphylococcus aureus infection, unspecified site
#B9562   Methicillin resistant Staphylococcus aureus infection as the cause of diseases classified elsewhere
#J15212  Pneumonia due to Methicillin resistant Staphylococcus aureus

# no MRSA cases in dataset
#enterococcus
entero_patients <- filter(alldiag_2020, DX %in% c("A4181", "B952"))
#A4181   Sepsis due to Enterococcus
#B952    Enterococcus as the cause of diseases classified elsewhere
#plot number of cases by demographics
#re-level age
entero_patients <- entero_patients |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(entero_patients, age.simplified)
#count(entero_patients, SEX)
#count(entero_patients, LOS) #length of stay
#count(entero_patients, LOS_30DAYS) #greater than 30 days
#count(entero_patients, DISCHARGE_MONTH)
#count(entero_patients, DISCHARGE_STATUS)
#visualization
ggplot(entero_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Enterococcus infections by sex") +
  theme_bw()

ggplot(entero_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("Enterococcus infections by age") +
  theme_bw()

ggplot(entero_patients, aes(x = LOS)) +
  geom_bar() +
  labs(x = "Length of stay (up to 14 days)", y = "Count") +
  ggtitle("Enterococcus infections by length of stay") +
  theme_bw()

#greater than 15 binned in data

ggplot(entero_patients, aes(x = LOS_30DAYS)) +
  geom_bar() +
  labs(x = "Length of stay greater than 30 days", y = "Count") +
  ggtitle("Enterococcus infections by length of stay greater than 30 days") +
  theme_bw()

ggplot(entero_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("Enterococcus infections by discharge month") +
  theme_bw()

#infections from catheter
infcatheter_patients <- filter(alldiag_2020, DX %in% c("T80211A", "T80211D ", "T80211S", "T80212A", "T80212D", "T80212S", "T80218A", "T80218D", "T80218S", "T80219A", "T80219D", "T80219S"))
#T80211A Bloodstream infection due to central venous catheter, initial encounter
#T80211D Bloodstream infection due to central venous catheter, subsequent encounter
#T80211S Bloodstream infection due to central venous catheter, sequela
#T80212A Local infection due to central venous catheter, initial encounter
#T80212D Local infection due to central venous catheter, subsequent encounter
#T80212S Local infection due to central venous catheter, sequela
#T80218A Other infection due to central venous catheter, initial encounter
#T80218D Other infection due to central venous catheter, subsequent encounter
#T80218S Other infection due to central venous catheter, sequela
#T80219A Unspecified infection due to central venous catheter, initial encounter
#T80219D Unspecified infection due to central venous catheter, subsequent encounter
#T80219S Unspecified infection due to central venous catheter, sequela

# no catheter-associated infections in dataset
# ventilator-associated pneumonia
ventpenumo_patients <- filter(alldiag_2020, DX =="J95851")
#J95851  Ventilator associated pneumonia

# no ventilator-associated pneumonia cases in dataset
# surgical site infections
surginf_patients <- filter(alldiag_2020, DX %in% c("T8141XA", "T8141XD", "T8141XS", "T8142XA", "T8142XD", "T8142XS", "T8143XA", "T8143XD", "T8143XS", "O8600", "O8601", "O8602", "O8603", "O8604", "08609"))

#T8141XA Infection following a procedure, superficial incisional surgical site, initial encounter
#T8141XD Infection following a procedure, superficial incisional surgical site, subsequent encounter
#T8141XS Infection following a procedure, superficial incisional surgical site, sequela
#T8142XA Infection following a procedure, deep incisional surgical site, initial encounter
#T8142XD Infection following a procedure, deep incisional surgical site, subsequent encounter
#T8142XS Infection following a procedure, deep incisional surgical site, sequela
#T8143XA Infection following a procedure, organ and space surgical site, initial encounter
#T8143XD Infection following a procedure, organ and space surgical site, subsequent encounter
#T8143XS Infection following a procedure, organ and space surgical site, sequela
#O8600   Infection of obstetric surgical wound, unspecified
#O8601   Infection of obstetric surgical wound, superficial incisional site
#O8602   Infection of obstetric surgical wound, deep incisional site
#O8603   Infection of obstetric surgical wound, organ and space site
#O8604   Sepsis following an obstetrical procedure
#O8609   Infection of obstetric surgical wound, other surgical site

# no surgical site infection cases in dataset
  • There were no cases of infections associated with surgeries, infections from catheter use, methicillin-resistant Staphylococcus aureus infections, or ventilator-associated pneumonia in the dataset.
#plot month data together 
covid_month <- ggplot(covid_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
covid_month <- covid_month + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

bactpneumo_month <- ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
bactpneumo_month <- bactpneumo_month + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

cdiff_month <- ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
cdiff_month <- cdiff_month + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

entero_month <- ggplot(entero_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

entero_month <- entero_month + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

plot_grid(covid_month, bactpneumo_month, cdiff_month, entero_month, labels = c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size = 12, label_x = -0.05, label_y = 1)

Next, I analyzed for association between discharge month and infection.

# determine whether discharge month is associated with COVID 

# filter DISCHARGE_MONTH and DX columns
dx_month <- alldiag_2020 |>
  select(c(DISCHARGE_MONTH, DX))

# Remove NA in DX
dx_month_narm <- dx_month[!is.na(dx_month$DX), ]

# Remove NA in Discharge Month
dx_month_narm <- dx_month_narm[!is.na(dx_month_narm$DISCHARGE_MONTH), ]

# create new column with COVID infection as factor
covid_binary_month <- dx_month_narm |>
  mutate(COVID = ifelse(DX == "U071", 1, 0))

# month as factor
covid_binary_month$DISCHARGE_MONTH <- factor(
  covid_binary_month$DISCHARGE_MONTH, 
  levels = 1:12, 
  labels = month.name
)

# relevel to set reference month to April
covid_binary_month$DISCHARGE_MONTH <- fct_relevel(covid_binary_month$DISCHARGE_MONTH, "April")

# Fit the logistic model
covid.month.fit <- glm(COVID ~ DISCHARGE_MONTH, data = covid_binary_month, family = binomial)

summary(covid.month.fit)

Call:
glm(formula = COVID ~ DISCHARGE_MONTH, family = binomial, data = covid_binary_month)

Coefficients:
                          Estimate Std. Error  z value Pr(>|z|)    
(Intercept)               -4.33340    0.02855 -151.786  < 2e-16 ***
DISCHARGE_MONTHJanuary    -7.50209    1.00041   -7.499 6.43e-14 ***
DISCHARGE_MONTHFebruary  -16.23267   49.74714   -0.326   0.7442    
DISCHARGE_MONTHMarch      -5.76748    0.44813  -12.870  < 2e-16 ***
DISCHARGE_MONTHMay        -0.60721    0.04544  -13.363  < 2e-16 ***
DISCHARGE_MONTHJune       -1.31565    0.05666  -23.221  < 2e-16 ***
DISCHARGE_MONTHJuly       -1.35050    0.05550  -24.334  < 2e-16 ***
DISCHARGE_MONTHAugust     -1.54094    0.05988  -25.735  < 2e-16 ***
DISCHARGE_MONTHSeptember  -1.90079    0.06933  -27.416  < 2e-16 ***
DISCHARGE_MONTHOctober    -1.36564    0.05500  -24.829  < 2e-16 ***
DISCHARGE_MONTHNovember   -0.58515    0.04379  -13.364  < 2e-16 ***
DISCHARGE_MONTHDecember   -0.09776    0.03829   -2.553   0.0107 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83171  on 1498051  degrees of freedom
Residual deviance: 76909  on 1498040  degrees of freedom
AIC: 76933

Number of Fisher Scoring iterations: 19
  • Relative to April (the first surge in COVID-19 cases), COVID-19 infection was negatively associated with May-December.
# determine whether discharge month is associated with bacterial pneumonia  

bactpneumo_code <- c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160")

# create new column with COVID infection as factor
bactpneumo_binary_month <- alldiag_2020 |>
  mutate(bactpneumo = ifelse(DX %in% bactpneumo_code, 1, 0))

# Fit the linear model
bactpneumo.month.fit <- glm(bactpneumo ~ factor(DISCHARGE_MONTH), data = bactpneumo_binary_month, family = binomial)

summary(bactpneumo.month.fit)

Call:
glm(formula = bactpneumo ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = bactpneumo_binary_month)

Coefficients:
                          Estimate Std. Error  z value Pr(>|z|)    
(Intercept)               -7.61100    0.07334 -103.774  < 2e-16 ***
factor(DISCHARGE_MONTH)2  -0.29591    0.11460   -2.582 0.009821 ** 
factor(DISCHARGE_MONTH)3   0.12533    0.10358    1.210 0.226310    
factor(DISCHARGE_MONTH)4   0.27074    0.10598    2.555 0.010626 *  
factor(DISCHARGE_MONTH)5   0.08101    0.10749    0.754 0.451060    
factor(DISCHARGE_MONTH)6  -0.36646    0.12065   -3.037 0.002387 ** 
factor(DISCHARGE_MONTH)7  -0.51773    0.12323   -4.201 2.65e-05 ***
factor(DISCHARGE_MONTH)8  -0.41547    0.11996   -3.463 0.000534 ***
factor(DISCHARGE_MONTH)9  -0.70243    0.13258   -5.298 1.17e-07 ***
factor(DISCHARGE_MONTH)10 -0.39638    0.11741   -3.376 0.000735 ***
factor(DISCHARGE_MONTH)11 -0.15140    0.11238   -1.347 0.177909    
factor(DISCHARGE_MONTH)12  0.05973    0.10518    0.568 0.570118    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29344  on 3980819  degrees of freedom
Residual deviance: 29215  on 3980808  degrees of freedom
AIC: 29239

Number of Fisher Scoring iterations: 11
  • Bacterial pneumonia was negatively associated with discharge months Jan, Feb, and Jun~Oct and positively associated with April.
# determine whether discharge month is associated with C. difficile infection 

# create new column with C. diff infection as factor
cdiff_binary_month <- alldiag_2020 |>
  mutate(cdiff = ifelse(DX == "A047", 1, 0))

# Fit the linear model
cdiff.month.fit <- glm(cdiff ~ factor(DISCHARGE_MONTH), data = cdiff_binary_month, family = binomial)

summary(cdiff.month.fit)

Call:
glm(formula = cdiff ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = cdiff_binary_month)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -7.161890   0.096711 -74.054   <2e-16 ***
factor(DISCHARGE_MONTH)2  -0.182908   0.146820  -1.246   0.2128    
factor(DISCHARGE_MONTH)3  -0.015104   0.141824  -0.106   0.9152    
factor(DISCHARGE_MONTH)4  -0.150278   0.158077  -0.951   0.3418    
factor(DISCHARGE_MONTH)5  -0.228811   0.153778  -1.488   0.1368    
factor(DISCHARGE_MONTH)6  -0.097025   0.145829  -0.665   0.5058    
factor(DISCHARGE_MONTH)7  -0.287661   0.150062  -1.917   0.0552 .  
factor(DISCHARGE_MONTH)8  -0.315877   0.151853  -2.080   0.0375 *  
factor(DISCHARGE_MONTH)9  -0.003822   0.139506  -0.027   0.9781    
factor(DISCHARGE_MONTH)10 -0.263541   0.147330  -1.789   0.0736 .  
factor(DISCHARGE_MONTH)11 -0.150290   0.145828  -1.031   0.3027    
factor(DISCHARGE_MONTH)12 -0.259636   0.148387  -1.750   0.0802 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16506  on 1498051  degrees of freedom
Residual deviance: 16494  on 1498040  degrees of freedom
  (2482768 observations deleted due to missingness)
AIC: 16518

Number of Fisher Scoring iterations: 10
  • Only January and August was negatively associated with C. difficile infection indicating that there is no seasonality in cases.
# determine whether discharge month is associated with Enterococcus infection 
entero_code <- c("A4181", "B952")

# create new column with Enterococcus infection as factor
entero_binary_month <- alldiag_2020 |>
  mutate(entero = ifelse(DX %in% entero_code, 1, 0))

# Fit the linear model
entero.month.fit <- glm(entero ~ factor(DISCHARGE_MONTH), data = entero_binary_month, family = binomial)

summary(entero.month.fit)

Call:
glm(formula = entero ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = entero_binary_month)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -8.88586    0.13868 -64.072   <2e-16 ***
factor(DISCHARGE_MONTH)2  -0.16754    0.20887  -0.802   0.4225    
factor(DISCHARGE_MONTH)3  -0.04716    0.20485  -0.230   0.8179    
factor(DISCHARGE_MONTH)4  -0.07022    0.22057  -0.318   0.7502    
factor(DISCHARGE_MONTH)5   0.19973    0.19709   1.013   0.3109    
factor(DISCHARGE_MONTH)6  -0.11961    0.21184  -0.565   0.5723    
factor(DISCHARGE_MONTH)7   0.15738    0.19260   0.817   0.4139    
factor(DISCHARGE_MONTH)8  -0.41638    0.22692  -1.835   0.0665 .  
factor(DISCHARGE_MONTH)9   0.01575    0.20128   0.078   0.9376    
factor(DISCHARGE_MONTH)10 -0.02964    0.20017  -0.148   0.8823    
factor(DISCHARGE_MONTH)11  0.02457    0.20242   0.121   0.9034    
factor(DISCHARGE_MONTH)12  0.13403    0.19520   0.687   0.4923    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10733  on 3980819  degrees of freedom
Residual deviance: 10721  on 3980808  degrees of freedom
AIC: 10745

Number of Fisher Scoring iterations: 12
  • Similar to C. difficle, Enterococcus infection was only negatively associated with January and August indicating no seasonality.

Bacterial pneumonia follows a similar seasonality as COVID-19 as expected, but C. difficile and Enterococcus infections appears to be pretty consistent throughout the year.

Next, I looked at the age distribution of each infection.

#plot age data together 
covid_age <- ggplot(covid_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#adjust margins for cowplot
covid_age <- covid_age + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

bactpneumo_age <- ggplot(bactpneumo_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#adjust margins for cowplot
bactpneumo_age <- bactpneumo_age + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

cdiff_age <- ggplot(Cdiff_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#adjust margins for cowplot
cdiff_age <- cdiff_age + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

entero_age <- ggplot(entero_patients, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

entero_age <- entero_age + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

plot_grid(covid_age, bactpneumo_age, cdiff_age, entero_age, labels = c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size = 12, label_x = -0.05, label_y = 1)

# Determine whether Age is associated with COVID-19

# filter DISCHARGE_MONTH and DX columns
dx_age <- alldiag_desc |>
  select(c(AGE, DX))

# Remove NA in DX
dx_age_narm <- dx_age[!is.na(dx_age$DX), ]

# Remove NA in Discharge Month
dx_age_narm <- dx_age_narm[!is.na(dx_age_narm$AGE), ]

# create new column with COVID infection as factor
covid_binary_age <- dx_age_narm |>
  mutate(COVID = ifelse(DX == "U071", 1, 0))

# Fit the logistic model
covid.age.fit <- glm(COVID ~ AGE, data = covid_binary_age, family = binomial)

summary(covid.age.fit)

Call:
glm(formula = COVID ~ AGE, family = binomial, data = covid_binary_age)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.9510096  0.0380230 -156.51   <2e-16 ***
AGE          0.0086774  0.0005929   14.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83128  on 1496830  degrees of freedom
Residual deviance: 82901  on 1496829  degrees of freedom
AIC: 82905

Number of Fisher Scoring iterations: 8
# Bacterial pneumonia 
# create new column with COVID infection as factor
bactpneumo_binary_age <- dx_age_narm |>
  mutate(bactpneumo = ifelse(DX %in% bactpneumo_code, 1, 0))

# Fit the logistic model
bactpneumo.age.fit <- glm(bactpneumo ~ AGE, data = bactpneumo_binary_age, family = binomial)

summary(bactpneumo.age.fit)

Call:
glm(formula = bactpneumo ~ AGE, family = binomial, data = bactpneumo_binary_age)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.241325   0.073380 -98.682  < 2e-16 ***
AGE          0.007636   0.001150   6.637 3.19e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26072  on 1496830  degrees of freedom
Residual deviance: 26025  on 1496829  degrees of freedom
AIC: 26029

Number of Fisher Scoring iterations: 9
# C. difficile
# create new column with COVID infection as factor
cdiff_binary_age <- dx_age_narm |>
  mutate(cdiff = ifelse(DX == "A047", 1, 0))

# Fit the logistic model
cdiff.age.fit <- glm(cdiff ~ AGE, data = cdiff_binary_age, family = binomial)

summary(cdiff.age.fit)

Call:
glm(formula = cdiff ~ AGE, family = binomial, data = cdiff_binary_age)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.982675   0.100811 -79.185  < 2e-16 ***
AGE          0.011227   0.001551   7.241 4.46e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16505  on 1496830  degrees of freedom
Residual deviance: 16448  on 1496829  degrees of freedom
AIC: 16452

Number of Fisher Scoring iterations: 10
# Enterococcus
# create new column with COVID infection as factor
entero_binary_age <- dx_age_narm |>
  mutate(entero = ifelse(DX %in% entero_code, 1, 0))

# Fit the logistic model
entero.age.fit <- glm(entero ~ AGE, data = entero_binary_age, family = binomial)

summary(entero.age.fit)

Call:
glm(formula = entero ~ AGE, family = binomial, data = entero_binary_age)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.563859   0.135560 -63.174  < 2e-16 ***
AGE          0.010844   0.002089   5.191 2.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9673.0  on 1496830  degrees of freedom
Residual deviance: 9643.9  on 1496829  degrees of freedom
AIC: 9647.9

Number of Fisher Scoring iterations: 11
  • All 4 infections predominantly affects older individuals (highest 60-69).
# proportions by discharge month and age

ggplot(covid_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Discharge Month", y = "Proportion", fill = "Age") +
  ggtitle("COVID-19 cases") +
  theme_bw()

ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Discharge Month", y = "Proportion", fill = "Age") +
  ggtitle("Bacterial pneumonia cases") +
  theme_bw()

ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Discharge Month", y = "Proportion", fill = "Age") +
  ggtitle("C. difficile infections") +
  theme_bw()

ggplot(entero_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Discharge Month", y = "Proportion", fill = "Age") +
  ggtitle("Enterococcus infections") +
  theme_bw()

# expand on bacterial pneumonia for kids under 10
bactpneumo_under10 <- bactpneumo_patients |>
  filter(AGE < 10)

bactpneumo_under10_plot <- ggplot(bactpneumo_under10, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  xlab("Discharge Month") +
  ggtitle("Bacterial pneumonia cases under 10 years old") +
  theme_bw()

bactpneumo_under10_plot

# determine whether discharge month is associated with bacterial pneumonia under 10
alldiag_2020_under10 <- alldiag_2020 |>
  filter(AGE < 10)

bactpneumo_under10_binary_month <- alldiag_2020_under10 |>
  mutate(bactpneumo = ifelse(DX %in% bactpneumo_code, 1, 0))

# Fit the linear model
bactpneumo.under10.month.fit <- glm(bactpneumo ~ factor(DISCHARGE_MONTH), data = bactpneumo_under10_binary_month, family = binomial)

summary(bactpneumo.under10.month.fit)

Call:
glm(formula = bactpneumo ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = bactpneumo_under10_binary_month)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -7.3202     0.1716 -42.669  < 2e-16 ***
factor(DISCHARGE_MONTH)2   -0.6339     0.2971  -2.134 0.032873 *  
factor(DISCHARGE_MONTH)3   -0.6580     0.3032  -2.170 0.030022 *  
factor(DISCHARGE_MONTH)4   -1.4468     0.4429  -3.267 0.001087 ** 
factor(DISCHARGE_MONTH)5  -17.2459   642.2556  -0.027 0.978578    
factor(DISCHARGE_MONTH)6   -3.3316     1.0146  -3.284 0.001025 ** 
factor(DISCHARGE_MONTH)7   -1.7686     0.4790  -3.692 0.000222 ***
factor(DISCHARGE_MONTH)8   -3.4035     1.0146  -3.354 0.000795 ***
factor(DISCHARGE_MONTH)9   -2.2344     0.6023  -3.710 0.000208 ***
factor(DISCHARGE_MONTH)10 -17.2459   629.0164  -0.027 0.978127    
factor(DISCHARGE_MONTH)11  -1.9128     0.5286  -3.618 0.000296 ***
factor(DISCHARGE_MONTH)12  -2.1338     0.6023  -3.543 0.000396 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1740.3  on 523499  degrees of freedom
Residual deviance: 1630.3  on 523488  degrees of freedom
AIC: 1654.3

Number of Fisher Scoring iterations: 23
  • The proportion of kids under 10 who were diagnosed for bacterial pneumonia decreased after the first few months.

I next looked at the sex distribution of infection.

# plot infection by sex together

covid_sex <- ggplot(covid_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("COVID-19") +
  theme_bw()

bactpneumo_sex <- ggplot(bactpneumo_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Bacterial penumonia") +
  theme_bw()

cdiff_sex <- ggplot(Cdiff_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("C. difficile") +
  theme_bw()

entero_sex <- ggplot(entero_patients, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Enterococcus") +
  theme_bw()


plot_grid(covid_sex, bactpneumo_sex, cdiff_sex, entero_sex, label_size = 12, label_x = -0.05, label_y = 1)

# determine whether sex is assocated with COVID-19 diagnosis

# filter DISCHARGE_MONTH and DX columns
dx_sex <- alldiag_2020 |>
  select(c(SEX, DX))

# Remove NA in DX
dx_sex_narm <- dx_sex[!is.na(dx_sex$DX), ]

# Remove NA in Sex
dx_sex_narm <- dx_sex_narm[!is.na(dx_sex_narm$SEX), ]

# create new column with COVID infection as factor
covid_binary_sex <- dx_sex_narm |>
  mutate(COVID = ifelse(DX == "U071", 1, 0))

# Fit the logistic model
covid.sex.fit <- glm(COVID ~ factor(SEX), data = covid_binary_sex, family = binomial)

summary(covid.sex.fit)

Call:
glm(formula = COVID ~ factor(SEX), family = binomial, data = covid_binary_sex)

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -5.35494    0.01727 -310.132  < 2e-16 ***
factor(SEX)Female -0.17561    0.02498   -7.029 2.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83138  on 1496739  degrees of freedom
Residual deviance: 83089  on 1496738  degrees of freedom
AIC: 83093

Number of Fisher Scoring iterations: 8
covid_sex_odds <- exp(coef(covid.sex.fit))
covid_sex_odds
      (Intercept) factor(SEX)Female 
      0.004724765       0.838946761 
  • Female sex is significantly negatively associated (0.839 fold) with COVID-19 compared to males.
# determine whether sex is assocated with bacterial pneumonia diagnosis

# create new column with COVID infection as factor
bactpneumo_binary_sex <- dx_sex_narm |>
  mutate(bactpneumo = ifelse(DX %in% bactpneumo_code, 1, 0))

# Fit the logistic model
bactpneumo.sex.fit <- glm(bactpneumo ~ factor(SEX), data = bactpneumo_binary_sex, family = binomial)

summary(bactpneumo.sex.fit)

Call:
glm(formula = bactpneumo ~ factor(SEX), family = binomial, data = bactpneumo_binary_sex)

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -6.60264    0.03211 -205.607   <2e-16 ***
factor(SEX)Female -0.41114    0.04958   -8.292   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26071  on 1496739  degrees of freedom
Residual deviance: 26002  on 1496738  degrees of freedom
AIC: 26006

Number of Fisher Scoring iterations: 9
bactpneumo_sex_odds <- exp(coef(bactpneumo.sex.fit))
bactpneumo_sex_odds
      (Intercept) factor(SEX)Female 
      0.001356786       0.662892382 
  • Female sex is also significantly negatively associated (0.662 fold) with bacterial pneumonia diagnosis compared to males.
# determine whether sex is assocated with C. difficile infection 

# create new column with COVID infection as factor
cdiff_binary_sex <- dx_sex_narm |>
  mutate(cdiff = ifelse(DX == "A047", 1, 0))

# Fit the logistic model
cdiff.sex.fit <- glm(cdiff ~ factor(SEX), data = cdiff_binary_sex, family = binomial)

summary(cdiff.sex.fit)

Call:
glm(formula = cdiff ~ factor(SEX), family = binomial, data = cdiff_binary_sex)

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -7.33963    0.04639 -158.220   <2e-16 ***
factor(SEX)Female  0.04032    0.06365    0.634    0.526    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16504  on 1496739  degrees of freedom
Residual deviance: 16504  on 1496738  degrees of freedom
AIC: 16508

Number of Fisher Scoring iterations: 10
  • C. difficile infection is not associated with a particular sex.
# determine whether sex is assocated with Enterococcus infection 

# create new column with COVID infection as factor
entero_binary_sex <- dx_sex_narm |>
  mutate(entero = ifelse(DX %in% entero_code, 1, 0))

# Fit the logistic model
entero.sex.fit <- glm(entero ~ factor(SEX), data = entero_binary_sex, family = binomial)

summary(entero.sex.fit)

Call:
glm(formula = entero ~ factor(SEX), family = binomial, data = entero_binary_sex)

Coefficients:
                  Estimate Std. Error  z value Pr(>|z|)    
(Intercept)       -7.79833    0.05833 -133.692  < 2e-16 ***
factor(SEX)Female -0.25511    0.08622   -2.959  0.00309 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9672.9  on 1496739  degrees of freedom
Residual deviance: 9664.1  on 1496738  degrees of freedom
AIC: 9668.1

Number of Fisher Scoring iterations: 10
entero_sex_odds <- exp(coef(entero.sex.fit))
entero_sex_odds
      (Intercept) factor(SEX)Female 
     0.0004104202      0.7748308447 
  • Female sex is negatively associated with Enterococcus infection (0.775 fold).

Next, I looked at the association between a COVID-19 diagnosis and hospital-associated infection.

# determine whether COVID diagnosis is associated with other hospital-associated infections

alldiag_covid_binary <- var_2020 |>
  mutate(COVID = ifelse(rowSums(select(var_2020, 21:50) == "U071", na.rm = TRUE) > 0, 1, 0))
# determine whether COVID is associated with bacterial pneumonia

bactpneumo_code <- c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160")

alldiag_bactpneumo_binary <- alldiag_covid_binary |>
  mutate(bactpneumo = ifelse(rowSums(select(alldiag_covid_binary, 21:50) == bactpneumo_code, na.rm = TRUE) > 0, 1, 0))

alldiag_bactpneumo_factor <- alldiag_bactpneumo_binary %>%
  mutate(COVID = factor(COVID, levels = c(0, 1), labels = c("Negative", "Positive")),
         bactpneumo = factor(bactpneumo, levels = c(0,1), labels = c("Negative", "Positive")))

# Create the plot with labeled factors
ggplot(alldiag_bactpneumo_factor, aes(x = bactpneumo, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "Bacterial Pneumonia", 
       y = "Proportion", 
       fill = "COVID Status")

# Fit the linear model
covid.bactpneumo.fit <- glm(bactpneumo ~ COVID, data = alldiag_bactpneumo_binary, family = binomial)

summary(covid.bactpneumo.fit)

Call:
glm(formula = bactpneumo ~ COVID, family = binomial, data = alldiag_bactpneumo_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -7.2344     0.1049  -68.99  < 2e-16 ***
COVID         1.6850     0.2262    7.45 9.33e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1865.7  on 132693  degrees of freedom
Residual deviance: 1826.3  on 132692  degrees of freedom
AIC: 1830.3

Number of Fisher Scoring iterations: 10
# adjust for sex
covid.bactpneumo.sex.fit <- glm(bactpneumo ~ SEX + COVID, data = alldiag_bactpneumo_binary, family = binomial)

summary(covid.bactpneumo.sex.fit)

Call:
glm(formula = bactpneumo ~ SEX + COVID, family = binomial, data = alldiag_bactpneumo_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.9436     0.1291 -53.777  < 2e-16 ***
SEXFemale    -0.6187     0.1917  -3.227  0.00125 ** 
COVID         1.6437     0.2265   7.257 3.97e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1865.5  on 132588  degrees of freedom
Residual deviance: 1815.3  on 132586  degrees of freedom
  (105 observations deleted due to missingness)
AIC: 1821.3

Number of Fisher Scoring iterations: 10
# adjust for age
covid.bactpneumo.age.fit <- glm(bactpneumo ~ AGE + COVID, data = alldiag_bactpneumo_binary, family = binomial)

summary(covid.bactpneumo.age.fit)

Call:
glm(formula = bactpneumo ~ AGE + COVID, family = binomial, data = alldiag_bactpneumo_binary)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.052699   0.262426 -30.686  < 2e-16 ***
AGE          0.015408   0.004154   3.709 0.000208 ***
COVID        1.531408   0.228352   6.706    2e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1865.6  on 132613  degrees of freedom
Residual deviance: 1811.0  on 132611  degrees of freedom
  (80 observations deleted due to missingness)
AIC: 1817

Number of Fisher Scoring iterations: 10
  • A positive coefficient (1.69) and a p-value less than 0.001 indicates that there is a very strong positive association between bacterial pneumonia and COVID diagnosis, regardless of sex or age.
# determine whether COVID is associated with C. difficile colitis

alldiag_cdiff_binary <- alldiag_covid_binary |>
  mutate(Cdiff = ifelse(rowSums(select(alldiag_covid_binary, 21:50) == "A047", na.rm = TRUE) > 0, 1, 0))

alldiag_cdiff_factor <- alldiag_cdiff_binary %>%
  mutate(COVID = factor(COVID, levels = c(0, 1), labels = c("Negative", "Positive")),
         Cdiff = factor(Cdiff, levels = c(0,1), labels = c("Negative", "Positive")))

# Create the plot with labeled factors
ggplot(alldiag_cdiff_factor, aes(x = Cdiff, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "C. difficile infection", 
       y = "Proportion", 
       fill = "COVID Status")

# Fit the linear model
covid.cdiff.fit <- glm(Cdiff ~ COVID, data = alldiag_cdiff_binary, family = binomial)

summary(covid.cdiff.fit)

Call:
glm(formula = Cdiff ~ COVID, family = binomial, data = alldiag_cdiff_binary)

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -4.89046    0.03270 -149.548   <2e-16 ***
COVID        0.03812    0.14568    0.262    0.794    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11690  on 132693  degrees of freedom
Residual deviance: 11690  on 132692  degrees of freedom
AIC: 11694

Number of Fisher Scoring iterations: 7
# adjust for sex
covid.cdiff.sex.fit <- glm(Cdiff ~ SEX + COVID, data = alldiag_cdiff_binary, family = binomial)

summary(covid.cdiff.sex.fit)

Call:
glm(formula = Cdiff ~ SEX + COVID, family = binomial, data = alldiag_cdiff_binary)

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -4.86594    0.04729 -102.888   <2e-16 ***
SEXFemale   -0.04387    0.06389   -0.687    0.492    
COVID        0.03463    0.14575    0.238    0.812    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11689  on 132588  degrees of freedom
Residual deviance: 11688  on 132586  degrees of freedom
  (105 observations deleted due to missingness)
AIC: 11694

Number of Fisher Scoring iterations: 7
# adjust for age
covid.cdiff.age.fit <- glm(Cdiff ~ AGE + COVID, data = alldiag_cdiff_binary, family = binomial)

summary(covid.cdiff.age.fit)

Call:
glm(formula = Cdiff ~ AGE + COVID, family = binomial, data = alldiag_cdiff_binary)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.118076   0.096028 -63.711   <2e-16 ***
AGE          0.022290   0.001472  15.145   <2e-16 ***
COVID       -0.162522   0.146031  -1.113    0.266    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11689  on 132613  degrees of freedom
Residual deviance: 11419  on 132611  degrees of freedom
  (80 observations deleted due to missingness)
AIC: 11425

Number of Fisher Scoring iterations: 8
  • There is no association between C. difficile colitis and COVID-19 diagnosis.
# determine whether COVID diagnosis is associated with Enterococcus infection

alldiag_entero_binary <- alldiag_covid_binary |>
  mutate(Entero = ifelse(rowSums(select(alldiag_covid_binary, 21:50) == entero_code, na.rm = TRUE) > 0, 1, 0))

alldiag_entero_factor <- alldiag_entero_binary %>%
  mutate(COVID = factor(COVID, levels = c(0, 1), labels = c("Negative", "Positive")),
         Entero = factor(Entero, levels = c(0,1), labels = c("Negative", "Positive")))

# Create the plot with labeled factors
ggplot(alldiag_entero_factor, aes(x = Entero, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "Enterococcus infection", 
       y = "Proportion", 
       fill = "COVID Status")

# Fit the linear model
covid.entero.fit <- glm(Entero ~ COVID, data = alldiag_entero_binary, family = binomial)

summary(covid.entero.fit)

Call:
glm(formula = Entero ~ COVID, family = binomial, data = alldiag_entero_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.21056    0.06293 -98.686   <2e-16 ***
COVID        0.21349    0.25810   0.827    0.408    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3873.6  on 132693  degrees of freedom
Residual deviance: 3873.0  on 132692  degrees of freedom
AIC: 3877

Number of Fisher Scoring iterations: 9
# adjust for sex
covid.entero.sex.fit <- glm(Entero ~ SEX + COVID, data = alldiag_entero_binary, family = binomial)

summary(covid.entero.sex.fit)

Call:
glm(formula = Entero ~ SEX + COVID, family = binomial, data = alldiag_entero_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.01866    0.08378 -71.837  < 2e-16 ***
SEXFemale   -0.38381    0.12284  -3.124  0.00178 ** 
COVID        0.18710    0.25824   0.725  0.46873    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3873.2  on 132588  degrees of freedom
Residual deviance: 3862.7  on 132586  degrees of freedom
  (105 observations deleted due to missingness)
AIC: 3868.7

Number of Fisher Scoring iterations: 9
  • There is no association between Enterococus infection and COVID-19 diagnosis.
  • Age is positively associated with COVID-19 diagnosis.

Emergency department data set

I also looked at the emergency department data set. Note, I did not include this in the results section as majority of patients (>80%) did not stay more than a day.

# Read in NHCS 2020 emergency department R Dataset
url2 <- "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHCS/2020/R/nhcs2020ed_r.rds"
nhcs2020ed <- read_rds(url2)
# emergency department dataset
#pull out variables list
var_2020_ed <- nhcs2020ed$variables

#select useful columns
var_2020_ed_select <- select(var_2020_ed, 1:37)

#variable names 
varnames_2020_ed <- colnames(var_2020_ed_select)

#pull out primary diagnosis (DX1)
diag1_2020_ed <- var_2020_ed_select$DX1

#determine the number of primary C. diff infections (ICD-10 diagnosis code: A047)
primaryCdiff_2020_ed <- sum(diag1_2020_ed == "A047", na.rm = TRUE)
# list primary diagnosis by count
var_2020_ed_select |>
  group_by(DX1) |>
  summarize(count = n()) |>
  arrange(desc(count))
# A tibble: 4,834 × 2
   DX1   count
   <chr> <int>
 1 <NA>  32745
 2 R078   7241
 3 U071   7184
 4 R079   6951
 5 J069   5635
 6 M255   5277
 7 R109   4273
 8 R101   4159
 9 R103   3990
10 N390   3955
# ℹ 4,824 more rows
# the code doesn't match exactly, for e.g. R078 has R0781, R0782, R0789 so can't accurately add descriptions. Left as is. The highest counts are chest pain, COVID-19, respiratory infection, joint pain, abdominal pain. 
var_2020_ed_select |>
  group_by(DISCHARGE_STATUS) |>
  summarize(count = n()) |>
  arrange(desc(count))
# A tibble: 9 × 2
  DISCHARGE_STATUS                         count
  <fct>                                    <int>
1 Routine to home                         314022
2 <NA>                                     27915
3 Other                                    22527
4 Left against medical advice               7554
5 Home health care                          6093
6 Transfer to short term facility           5279
7 Transfer to long term facility            2115
8 Dead                                      2063
9 Hospice care - home or medical facility   1185
ed_routine_to_home <- var_2020_ed_select |>
  filter(DISCHARGE_STATUS == "Routine to home") |>
  summarize(count = n()) |>
  pull(count)

ed_total_count <- var_2020_ed_select |>
  group_by(DISCHARGE_STATUS) |>
  summarize(count = n()) |>
  summarize(sum(count))

prop_ed_routine <- ed_routine_to_home / ed_total_count
prop_ed_routine
  sum(count)
1  0.8077674
  • 81% routine to home from emergency department.
#combine all diagnosis
alldiag_2020_ed <- var_2020_ed_select |>
  pivot_longer(
    cols = c(DX1:DX30), 
    values_to = "DX"
  )

#all COVID infections
all_covid_ed <- sum(alldiag_2020_ed =="U071", na.rm = TRUE)
all_covid_ed
[1] 10140
#all C.diff infection
allCdiff_ed <- sum(alldiag_2020_ed == "A047", na.rm = TRUE)
allCdiff_ed
[1] 492
  • 10,140 total COVID-19 diagnosis in emergency department.

  • 492 total C.diff infections in emergency department.

# Add descriptions to diagnosis (emergency room)
alldiag_desc_ed <- left_join(alldiag_2020_ed, icd_code)
Joining with `by = join_by(DX)`
# diagnosis arranged by number of cases (emergency room)
alldiag_desc_ed |>
  group_by(Description) |>
  summarize(count = n()) |>
  arrange(desc(count))
# A tibble: 3,452 × 2
   Description                                               count
   <chr>                                                     <int>
 1 <NA>                                                   10849914
 2 Essential (primary) hypertension                          55921
 3 Hyperlipidemia, unspecified                               22658
 4 Type 2 diabetes mellitus without complications            17780
 5 Anxiety disorder, unspecified                             16417
 6 Gastro-esophageal reflux disease without esophagitis      15842
 7 Chest pain, unspecified                                   14190
 8 Major depressive disorder, single episode, unspecified    13327
 9 Cough                                                     12254
10 Unspecified abdominal pain                                11393
# ℹ 3,442 more rows
#subset COVID-19 patients
covid_patients_ed <- filter(alldiag_2020_ed, DX =="U071")
covid_patients_ed <- covid_patients_ed |>
  filter(!is.na(AGE)) |>
  filter(!is.na(SEX))
#re-level age
covid_patients_ed <- covid_patients_ed |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(covid_patients_ed, age.simplified)
#count(covid_patients_ed, AGE)
#count(covid_patients_ed, SEX)
#count(covid_patients_ed, DISCHARGE_MONTH)
#count(covid_patients_ed, DISCHARGE_STATUS)
#visualization
ggplot(covid_patients_ed, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("COVID-19 cases by sex") +
  theme_bw()

ggplot(covid_patients_ed, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("COVID-19 cases by age") +
  theme_bw()

ggplot(covid_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("COVID-19 cases by discharge month") +
  theme_bw()

#subset bacterial pneumonia patients
bactpneumo_patients_ed <- filter(alldiag_2020_ed, DX %in% c("J13", "J14", "J150", "J151", "J1520", "J15211", "J1529", "J153", "J154", "J155", "J1561", "J1569", "J157", "J158", "J159", "J160"))
#J13     Pneumonia due to Streptococcus pneumoniae
#J14     Pneumonia due to Hemophilus influenzae
#J150    Pneumonia due to Klebsiella pneumoniae
#J151    Pneumonia due to Pseudomonas
#J1520   Pneumonia due to staphylococcus, unspecified
#J15211  Pneumonia due to Methicillin susceptible Staphylococcus aureus
#J15212  Pneumonia due to Methicillin resistant Staphylococcus aureus
#J1529   Pneumonia due to other staphylococcus
#J153    Pneumonia due to streptococcus, group B
#J154    Pneumonia due to other streptococci
#J155    Pneumonia due to Escherichia coli
#J1561   Pneumonia due to Acinetobacter baumannii
#J1569   Pneumonia due to other Gram-negative bacteria
#J157    Pneumonia due to Mycoplasma pneumoniae
#J158    Pneumonia due to other specified bacteria
#J159    Unspecified bacterial pneumonia
#J160    Chlamydial pneumonia
#re-level age
bactpneumo_patients_ed <- bactpneumo_patients_ed |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(bactpneumo_patients_ed, age.simplified)
#count(bactpneumo_patients_ed, SEX)
#count(bactpneumo_patients_ed, DISCHARGE_MONTH)
#count(bactpneumo_patients_ed, DISCHARGE_STATUS)
#visualization
ggplot(bactpneumo_patients_ed, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Bacterial penumonia cases by sex") +
  theme_bw()

ggplot(bactpneumo_patients_ed, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
    ggtitle("Bacterial penumonia cases by age") +
  theme_bw()

ggplot(bactpneumo_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
    ggtitle("Bacterial penumonia cases by discharge month") +
  theme_bw()

#subset C.diff patients
Cdiff_patients_ed <- filter(alldiag_2020_ed, DX == "A047")
#plot number of cases by demographics
#re-level age
Cdiff_patients_ed <- Cdiff_patients_ed |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(Cdiff_patients_ed, age.simplified)
#count(Cdiff_patients_ed, SEX)
#count(Cdiff_patients_ed, DISCHARGE_MONTH)
#count(Cdiff_patients_ed, DISCHARGE_STATUS)
#visualization
ggplot(Cdiff_patients_ed, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("C. difficile infection by sex") +
  theme_bw()

ggplot(Cdiff_patients_ed, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("C. difficile infection by age") +
  theme_bw()

ggplot(Cdiff_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("C. difficile infection by discharge month") +
  theme_bw()

#MRSA
mrsa_patients_ed <- filter(alldiag_2020_ed, DX %in% c("A4102", "A4902", "B9562", "J15212"))
#A4102   Sepsis due to Methicillin resistant Staphylococcus aureus
#A4902   Methicillin resistant Staphylococcus aureus infection, unspecified site
#B9562   Methicillin resistant Staphylococcus aureus infection as the cause of diseases classified elsewhere
#J15212  Pneumonia due to Methicillin resistant Staphylococcus aureus

# no MRSA cases in dataset
#enterococcus
entero_patients_ed <- filter(alldiag_2020_ed, DX %in% c("A4181", "B952"))
#A4181   Sepsis due to Enterococcus
#B952    Enterococcus as the cause of diseases classified elsewhere
#plot number of cases by demographics
#re-level age
entero_patients_ed <- entero_patients_ed |>
  mutate(age.simplified = cut(AGE, 
                              breaks = c(0, 5, 10, 15, 20, seq(30, 100, by = 10)), 
                              right = FALSE, 
                              labels = c("less than 5", "5-9", "10-14", "15-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")))

#count(entero_patients_ed, age.simplified)
#count(entero_patients_ed, SEX)
#count(entero_patients_ed, DISCHARGE_MONTH)
#count(entero_patients_ed, DISCHARGE_STATUS)
#visualization
ggplot(entero_patients_ed, aes(x = SEX)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Enterococcus infections by sex") +
  theme_bw()

ggplot(entero_patients_ed, aes(x = age.simplified)) +
  geom_bar() +
  labs(x = "Age", y = "Count") +
  ggtitle("Enterococcus infections by age") +
  theme_bw()

ggplot(entero_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("Enterococcus infections by discharge month") +
  theme_bw()

#infections from catheter
infcatheter_patients_ed <- filter(alldiag_2020_ed, DX %in% c("T80211A", "T80211D ", "T80211S", "T80212A", "T80212D", "T80212S", "T80218A", "T80218D", "T80218S", "T80219A", "T80219D", "T80219S"))
#T80211A Bloodstream infection due to central venous catheter, initial encounter
#T80211D Bloodstream infection due to central venous catheter, subsequent encounter
#T80211S Bloodstream infection due to central venous catheter, sequela
#T80212A Local infection due to central venous catheter, initial encounter
#T80212D Local infection due to central venous catheter, subsequent encounter
#T80212S Local infection due to central venous catheter, sequela
#T80218A Other infection due to central venous catheter, initial encounter
#T80218D Other infection due to central venous catheter, subsequent encounter
#T80218S Other infection due to central venous catheter, sequela
#T80219A Unspecified infection due to central venous catheter, initial encounter
#T80219D Unspecified infection due to central venous catheter, subsequent encounter
#T80219S Unspecified infection due to central venous catheter, sequela

# no catheter-associated infections in dataset
# surgical site infections
surginf_patients_ed <- filter(alldiag_2020_ed, DX %in% c("T8141XA", "T8141XD", "T8141XS", "T8142XA", "T8142XD", "T8142XS", "T8143XA", "T8143XD", "T8143XS", "O8600", "O8601", "O8602", "O8603", "O8604", "08609"))

#T8141XA Infection following a procedure, superficial incisional surgical site, initial encounter
#T8141XD Infection following a procedure, superficial incisional surgical site, subsequent encounter
#T8141XS Infection following a procedure, superficial incisional surgical site, sequela
#T8142XA Infection following a procedure, deep incisional surgical site, initial encounter
#T8142XD Infection following a procedure, deep incisional surgical site, subsequent encounter
#T8142XS Infection following a procedure, deep incisional surgical site, sequela
#T8143XA Infection following a procedure, organ and space surgical site, initial encounter
#T8143XD Infection following a procedure, organ and space surgical site, subsequent encounter
#T8143XS Infection following a procedure, organ and space surgical site, sequela
#O8600   Infection of obstetric surgical wound, unspecified
#O8601   Infection of obstetric surgical wound, superficial incisional site
#O8602   Infection of obstetric surgical wound, deep incisional site
#O8603   Infection of obstetric surgical wound, organ and space site
#O8604   Sepsis following an obstetrical procedure
#O8609   Infection of obstetric surgical wound, other surgical site

# no cases in dataset
#plot month data together 
covid_month_ed <- ggplot(covid_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
covid_month_ed <- covid_month_ed + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

bactpneumo_month_ed <- ggplot(bactpneumo_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
bactpneumo_month_ed <- bactpneumo_month_ed + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

cdiff_month_ed <- ggplot(Cdiff_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

#adjust margins for cowplot
cdiff_month_ed <- cdiff_month_ed + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

entero_month_ed <- ggplot(entero_patients_ed, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  theme_bw()

entero_month_ed <- entero_month_ed + theme(plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

plot_grid(covid_month_ed, bactpneumo_month_ed, cdiff_month_ed, entero_month_ed, labels = c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size = 12, label_x = -0.05, label_y = 1)

# Compare inpatient and emergency room counts for each

plot_grid(covid_month, covid_month_ed, labels = c('COVID-19 (inpatient)', 'COVID-19 (emergency room)'), label_size = 12, label_x = -0.05, label_y = 1)

# similar trend between inpatient and emergency room (dipped in the summer, surged in the winter)

plot_grid(bactpneumo_month, bactpneumo_month_ed, labels = c('Bact pneumonia (inpatient)', 'Bact pneumonia (ER)'), label_size = 12, label_x = -0.05, label_y = 1)

# similar trend to covid; no difference between inpatient and ER

plot_grid(cdiff_month, cdiff_month_ed, labels = c('C.diff (inpatient)', 'C.diff (emergency room)'), label_size = 12, label_x = -0.05, label_y = 1)

# more seasonality in the emergency room data for C. diff

plot_grid(entero_month, entero_month_ed, labels = c('Enterococcus (inpatient)', 'Enterococcus (ER)'), label_size = 12, label_x = -0.05, label_y = 1)

#ER data seamingly random (surge in June); doesn't seem to coincide with COVID case
  • Overall, the inpatient data and emergency room data follows a similar trend by month. There may be more seasonality in the C. difficile emergency room data however, with a drop in cases between April and August.

4 Results

I chose to include the inpatient data only, because >80% of emergency room visitors were sent home while only 16% of inpatient cases were sent home the same day. Length of stay in the hospital is a major factor in acquiring hospital-associated infection. Overall the month-to-month trends for COVID-19 and hospital infections between inpatient and emergency room visits appeared similar.

prop_ip_routine
  sum(count)
1  0.1610849
# 16% of inpatient visits stayed only 1 day 
prop_ed_routine
  sum(count)
1  0.8077674
# 81% routine to home from emergency deparment 

All 4 infections (COVID-19, bacterial pneumonia, C. difficile, Enterococcus) predominantly affected older populations. For all 4, the highest number of cases was between the ages of 60-69.

plot_grid(covid_age, bactpneumo_age, cdiff_age, entero_age, labels = c('COVID-19', 'Bacterial pneumonia', 'C. difficile', 'Enterococcus'), label_size = 12, label_x = -0.05, label_y = 1)

I also found that there is a negative association between female sex and being diagnosed for COVID-19, bacterial pneumonia, and Enterococcus infection. This is consistent with published reports. COVID-19 infection was more prevalent in men than women in the early phases of the pandemic (Abate et al. 2020) and the severity and outcomes from COVID-19 was worse in men (Pradhan and Olsson. 2020). Men also have been reported to develop pneumonia more frequently than women (Gay et al. 2021). Studies have also shown that bloodstreamn infections with vancomycin-resistant enterococci (VRE) has a male preference (Correa-Martinez et al. 2021).

The incidence of C. difficile meanwhile was not significantly associated with a particular sex. While published reports are mixed, some studies have shown that C. difficile infection encounters are more often female (Young et al. 2022), which is in line with the slightly higher case count in this data set. Together, this analysis shows that the sex-bias in bacterial pneumonia, Enterococcus, and C. difficile infection was not affected by the COVID-19 pandemic in 2020.

plot_grid(covid_sex, bactpneumo_sex, cdiff_sex, entero_sex, label_size = 12, label_x = -0.05, label_y = 1)

As expected, there was a correlation between discharge month (which I used as a proxy for diagnosis month since that wasn’t available) and COVID-19 diagnosis. Relative to April, COVID-19 was negatively associated with the later months. It also appears that there is a seasonality to COVID-19 cases, where cases dropped in the summer months and surged in the winter.

ggplot(covid_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("COVID-19 cases by discharge month") +
  theme_bw()

summary(covid.month.fit)

Call:
glm(formula = COVID ~ DISCHARGE_MONTH, family = binomial, data = covid_binary_month)

Coefficients:
                          Estimate Std. Error  z value Pr(>|z|)    
(Intercept)               -4.33340    0.02855 -151.786  < 2e-16 ***
DISCHARGE_MONTHJanuary    -7.50209    1.00041   -7.499 6.43e-14 ***
DISCHARGE_MONTHFebruary  -16.23267   49.74714   -0.326   0.7442    
DISCHARGE_MONTHMarch      -5.76748    0.44813  -12.870  < 2e-16 ***
DISCHARGE_MONTHMay        -0.60721    0.04544  -13.363  < 2e-16 ***
DISCHARGE_MONTHJune       -1.31565    0.05666  -23.221  < 2e-16 ***
DISCHARGE_MONTHJuly       -1.35050    0.05550  -24.334  < 2e-16 ***
DISCHARGE_MONTHAugust     -1.54094    0.05988  -25.735  < 2e-16 ***
DISCHARGE_MONTHSeptember  -1.90079    0.06933  -27.416  < 2e-16 ***
DISCHARGE_MONTHOctober    -1.36564    0.05500  -24.829  < 2e-16 ***
DISCHARGE_MONTHNovember   -0.58515    0.04379  -13.364  < 2e-16 ***
DISCHARGE_MONTHDecember   -0.09776    0.03829   -2.553   0.0107 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83171  on 1498051  degrees of freedom
Residual deviance: 76909  on 1498040  degrees of freedom
AIC: 76933

Number of Fisher Scoring iterations: 19

Bacterial pneumonia had a similar seasonality as COVID-19, dropping from June to October and surging in the colder months. Analysis of the association of COVID-19 infection and bacterial pneumonia diagnosis reveals a significant positive association, indicating that COVID-19 infection increases the risk of bacterial pneumonia. Secondary bacterial infection is a important contributor to mortality following viral infection, including COVID-19 (Morens et al. 2008, Gao et al. 2023). Interestingly, the incidence of bacterial pneumonia by age changed throughout 2020, where children under the age of 10 that made up nearly 20% of cases in January significantly decreased, likely due to changes in exposure from the COVID-19 pandemic.

ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
    ggtitle("Bacterial penumonia cases by discharge month") +
  theme_bw()

summary(bactpneumo.month.fit)

Call:
glm(formula = bactpneumo ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = bactpneumo_binary_month)

Coefficients:
                          Estimate Std. Error  z value Pr(>|z|)    
(Intercept)               -7.61100    0.07334 -103.774  < 2e-16 ***
factor(DISCHARGE_MONTH)2  -0.29591    0.11460   -2.582 0.009821 ** 
factor(DISCHARGE_MONTH)3   0.12533    0.10358    1.210 0.226310    
factor(DISCHARGE_MONTH)4   0.27074    0.10598    2.555 0.010626 *  
factor(DISCHARGE_MONTH)5   0.08101    0.10749    0.754 0.451060    
factor(DISCHARGE_MONTH)6  -0.36646    0.12065   -3.037 0.002387 ** 
factor(DISCHARGE_MONTH)7  -0.51773    0.12323   -4.201 2.65e-05 ***
factor(DISCHARGE_MONTH)8  -0.41547    0.11996   -3.463 0.000534 ***
factor(DISCHARGE_MONTH)9  -0.70243    0.13258   -5.298 1.17e-07 ***
factor(DISCHARGE_MONTH)10 -0.39638    0.11741   -3.376 0.000735 ***
factor(DISCHARGE_MONTH)11 -0.15140    0.11238   -1.347 0.177909    
factor(DISCHARGE_MONTH)12  0.05973    0.10518    0.568 0.570118    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29344  on 3980819  degrees of freedom
Residual deviance: 29215  on 3980808  degrees of freedom
AIC: 29239

Number of Fisher Scoring iterations: 11
# Create the plot with labeled factors
ggplot(alldiag_bactpneumo_factor, aes(x = bactpneumo, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "Bacterial Pneumonia", 
       y = "Proportion", 
       fill = "COVID Status")

summary(covid.bactpneumo.fit)

Call:
glm(formula = bactpneumo ~ COVID, family = binomial, data = alldiag_bactpneumo_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -7.2344     0.1049  -68.99  < 2e-16 ***
COVID         1.6850     0.2262    7.45 9.33e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1865.7  on 132693  degrees of freedom
Residual deviance: 1826.3  on 132692  degrees of freedom
AIC: 1830.3

Number of Fisher Scoring iterations: 10
ggplot(bactpneumo_patients, aes(x = DISCHARGE_MONTH, fill = age.simplified)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Spectral") +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Discharge Month", y = "Proportion", fill = "Age") +
  ggtitle("Bacterial pneumonia cases") +
  theme_bw()

bactpneumo_under10_plot

I next performed a similar analysis for two bacterial pathogens commonly associated with hospital-associated infections outside the lungs. There was no clear seasonality with C. difficile infection; cases remained relatively stable throughout the year. Additionally, COVID-19 infection was not associated with C. difficile infection. A study that performed a single-center comparison in hospital-associated C. difficile infection between 2019 (pre-pandemic) and 2020 (pandemic) also found the incidence remained unchanged; however, their study found that the incidence of hospital-associated C. difficile infection was higher in patients with COVID-19 than in non-COVID-19 cases which they attributed to longer hospital stay and increased use of broad-spectrum antibiotics (Vazquez-Cuesta et al. 2022). The discrepancies between their study and my analysis may be reflective of the limitations of a single-center analysis. Several studies reported a lower C. difficile infection rate compared to pre-pandemic period (Granata et al. 2022).

ggplot(Cdiff_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("C. difficile infection by discharge month") +
  theme_bw()

summary(cdiff.month.fit)

Call:
glm(formula = cdiff ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = cdiff_binary_month)

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -7.161890   0.096711 -74.054   <2e-16 ***
factor(DISCHARGE_MONTH)2  -0.182908   0.146820  -1.246   0.2128    
factor(DISCHARGE_MONTH)3  -0.015104   0.141824  -0.106   0.9152    
factor(DISCHARGE_MONTH)4  -0.150278   0.158077  -0.951   0.3418    
factor(DISCHARGE_MONTH)5  -0.228811   0.153778  -1.488   0.1368    
factor(DISCHARGE_MONTH)6  -0.097025   0.145829  -0.665   0.5058    
factor(DISCHARGE_MONTH)7  -0.287661   0.150062  -1.917   0.0552 .  
factor(DISCHARGE_MONTH)8  -0.315877   0.151853  -2.080   0.0375 *  
factor(DISCHARGE_MONTH)9  -0.003822   0.139506  -0.027   0.9781    
factor(DISCHARGE_MONTH)10 -0.263541   0.147330  -1.789   0.0736 .  
factor(DISCHARGE_MONTH)11 -0.150290   0.145828  -1.031   0.3027    
factor(DISCHARGE_MONTH)12 -0.259636   0.148387  -1.750   0.0802 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16506  on 1498051  degrees of freedom
Residual deviance: 16494  on 1498040  degrees of freedom
  (2482768 observations deleted due to missingness)
AIC: 16518

Number of Fisher Scoring iterations: 10
# Create the plot with labeled factors
ggplot(alldiag_cdiff_factor, aes(x = Cdiff, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "C. difficile infection", 
       y = "Proportion", 
       fill = "COVID Status")

summary(covid.cdiff.fit)

Call:
glm(formula = Cdiff ~ COVID, family = binomial, data = alldiag_cdiff_binary)

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -4.89046    0.03270 -149.548   <2e-16 ***
COVID        0.03812    0.14568    0.262    0.794    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11690  on 132693  degrees of freedom
Residual deviance: 11690  on 132692  degrees of freedom
AIC: 11694

Number of Fisher Scoring iterations: 7

Similarly, the incidence of Enterococcus infection did not have a clear seasonality and there was a lack of association between COVID-19 diagnosis and Enterococcus co-infection.

ggplot(entero_patients, aes(x = DISCHARGE_MONTH)) +
  geom_bar() +
  scale_x_continuous(breaks = 1:12) +
  labs(x = "Month", y = "Count") +
  ggtitle("Enterococcus infections by discharge month") +
  theme_bw()

summary(entero.month.fit)

Call:
glm(formula = entero ~ factor(DISCHARGE_MONTH), family = binomial, 
    data = entero_binary_month)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -8.88586    0.13868 -64.072   <2e-16 ***
factor(DISCHARGE_MONTH)2  -0.16754    0.20887  -0.802   0.4225    
factor(DISCHARGE_MONTH)3  -0.04716    0.20485  -0.230   0.8179    
factor(DISCHARGE_MONTH)4  -0.07022    0.22057  -0.318   0.7502    
factor(DISCHARGE_MONTH)5   0.19973    0.19709   1.013   0.3109    
factor(DISCHARGE_MONTH)6  -0.11961    0.21184  -0.565   0.5723    
factor(DISCHARGE_MONTH)7   0.15738    0.19260   0.817   0.4139    
factor(DISCHARGE_MONTH)8  -0.41638    0.22692  -1.835   0.0665 .  
factor(DISCHARGE_MONTH)9   0.01575    0.20128   0.078   0.9376    
factor(DISCHARGE_MONTH)10 -0.02964    0.20017  -0.148   0.8823    
factor(DISCHARGE_MONTH)11  0.02457    0.20242   0.121   0.9034    
factor(DISCHARGE_MONTH)12  0.13403    0.19520   0.687   0.4923    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10733  on 3980819  degrees of freedom
Residual deviance: 10721  on 3980808  degrees of freedom
AIC: 10745

Number of Fisher Scoring iterations: 12
# Create the plot with labeled factors
ggplot(alldiag_entero_factor, aes(x = Entero, fill = COVID)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("Negative" = "#aaaaaa", "Positive" = "#b20000")) +
  theme_bw() +
  labs(x = "Enterococcus infection", 
       y = "Proportion", 
       fill = "COVID Status")

summary(covid.entero.fit)

Call:
glm(formula = Entero ~ COVID, family = binomial, data = alldiag_entero_binary)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.21056    0.06293 -98.686   <2e-16 ***
COVID        0.21349    0.25810   0.827    0.408    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3873.6  on 132693  degrees of freedom
Residual deviance: 3873.0  on 132692  degrees of freedom
AIC: 3877

Number of Fisher Scoring iterations: 9

5 Conclusion

Overall my findings indicate that the COVID-19 pandemic did not significantly influence the incidence of two common hospital-acquired infections (C. difficile and Enterococcus) in the early phase of the pandemic. As expected, bacterial pneumonia which is commonly associated with respiratory viral infections had a similar seasonality as COVID-19 and was significantly associated with COVID-19 infection.

This study was limited by the data set available. Ideally, I would’ve liked to compare the results from 2020 to pre-pandemic levels (2019) and post-pandemic data (2022-). However, these data sets were restricted and not publicly available. Another limitation of this data set was the lack of incidence of other common hospital-acquired infections (e.g. central-line bloodstream infections, catheter-associated UTIs, and MRSA). For example, a study by the CDC found that the incidence of central-line bloodstream infections, catheter-associated UTIs, ventilator-associated infections, and MRSA bacteremia was significantly higher in 2021 than pre-pandemic levels (Lastinger et al. 2022). It would have also been nice to have geographical data as well, since certain regions were more heavily affected by the pandemic than others.

Sources:

Pradhan and Olsson. (2020) Sex differences in severity and mortality from COVID-19: are males more vulnerable? Biology of Sex Differences. https://bsd.biomedcentral.com/articles/10.1186/s13293-020-00330-7

Abate et al. (2020) Sex difference in coronavirus disease (COVID-19): a systematic review and meta-analysis. BMJ Open. https://bmjopen.bmj.com/content/10/10/e040129

Gay et al. (2021) Sexual Dimorphism and Gender in Infectious Diseases. Frontiers in Immunology. https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2021.698121/full

Correa-Martinez et al. (2021) Sex differences in vancomycin-resistant enterococci bloodstream infections—a systematic review and meta-analysis. Biology of Sex Differences. https://bsd.biomedcentral.com/articles/10.1186/s13293-021-00380-5

Young et al. (2022) Clostridioides difficile Infection Treatment and Outcome Disparities in a National Sample of United States Hospitals. Antibiotics. https://www.mdpi.com/2079-6382/11/9/1203

Morens et al. (2008) Predominant role of bacterial pneumonia as a cause of death in pandemic influenza: implications for pandemic influenza preparedness. J Infect Dis.

Gao et al. (2023) Machine learning links unresolving secondary pneumonia to mortality in patients with severe pneumonia, including COVID-19. JCI. https://www.jci.org/articles/view/170682#B17

Vazquez-Cuesta et al. (2022) Clostridioides difficile infection epidemiology and clinical characteristics in COVID-19 pandemic. Front. Med. https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2022.953724/full#B4

Granata et al. (2022) The burden of Clostridioides difficile infection in COVID-19 patients: A systematic review and meta-analysis. Anaerobe. https://www.sciencedirect.com/science/article/pii/S1075996421001670?casa_token=8a_HWG8ihJoAAAAA:mR-h9srMVlu0YMgt7sljtRPskR7M36TMpuaKKY8IlpW0tqyo507Ph1UHeMGFgVu3Xoa8oL1ebQ

Lastinger et al. (2022) Continued increases in the incidence of healthcare-associated infection (HAI) during the second year of the coronavirus disease 2019 (COVID-19) pandemic. Infection Control & Hospital Epidemiology. https://www.cambridge.org/core/journals/infection-control-and-hospital-epidemiology/article/continued-increases-in-the-incidence-of-healthcareassociated-infection-hai-during-the-second-year-of-the-coronavirus-disease-2019-covid19-pandemic/6521FF7F27B97AD5B21BA4B8EA9888A4